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Abstract 

We present a novel method for solving Canonical Correlation Anal- 
ysis (CCA) in a sparse convex framework using a least squares ap- 
proach. The presented method focuses on the scenario when one is 
interested in (or limited to) a primal representation for the first view 
while having a dual representation for the second view. Sparse CCA 
(SCCA) minimises the number of features used in both the primal 
and dual projections while maximising the correlation between the 
two views. The method is demonstrated on two paired corpuses of 
English-French and English-Spanish for mate-retrieval. We are able 
to observe, in the mate-retreival, that when the number of the origi- 
nal features is large SCCA outperforms Kernel CCA (KCCA), learning 
the common semantic space from a sparse set of features. 

1 Introduction 



Proposed by (Hotelling, 1936), CCA is a technique for finding pairs of vectors 
that maximises the correlation between a set of paired variables. The set of 
paired variables can be considered as two views of the same object, a perspec- 
tive we adopt throughout the paper. Since the debut of CCA, a multitude 
of analyses, adaptations and applications have been proposed (Ketterling, 
1971; Fyfe & Lai, 2000a,b; Akaho, 2001; Friman, Carlsson, Lundberg, Borga 
& Knutsson, 2001b; Friman, Borga, Lundberg & Knutsson, 2001a; Bach & 



1 



Jordan, 2002; Hardoon & Shawe- Taylor, 2003; Hardoon, Szedmak & Shawe- 
Taylor, 2004; Fukumizu, Bach & Grctton, 2007; Hardoon, Saunders, Szedmak 
& Shawe- Taylor, 2006; Szedmak, Bie & Hardoon, 2007; Hardoon, Mourao- 
Miranda, Brammer & Shawe- Taylor, 2007). 

The potential disadvantage of CCA and similar statistical methods, such 
as Principle Component Analysis (PCA) and Partial Least Squares (PLS), 
is that the learned projections are a linear combination of all the features 
in the primal and dual representations respectively. This makes the in- 
terpretation of the solutions difficult. Studies by (Zou, Hastie & Tibshi- 
rani, 2004; Moghaddam, Weiss & Avidan, 2006; Dhanjal, Gunn & Shawe- 
Taylor, 2006) and the more recent (d'Aspremont, Ghaoui, Jordan & Lanck- 
riet, 2007; Sriperumbudur, Torres & Lanckriet, 2007) have addressed this 
issue for PCA and PLS by learning only the relevant features that maximise 
the variance for PCA and covariance for PLS. A previous application of sparse 
CCA has been proposed in (Torres, TurnbuU, Harrington & Lanckriet, 2007) 
where the authors imposed sparsity on the semantic space by penalising the 
cardinality of the solution vector (Weston, Elisseeff, Scholkopf & Tipping, 
2003). The SCCA presented in this paper is novel to the extent that in- 
stead of working with covariance matrices (Torres, TurnbuU, Barrington & 
Lanckriet, 2007), which may be computationally intensive to compute when 
the dimensionality of the data is large, it deals directly with the training data. 

In the Machine Learning (ML) community it is common practice to refer 
to the input space as the primal-representation and the kernel space as the 
dual-representation. In order to avoid confusion with the meanings of the 
terms primal and dual commonly used in the optimisation literature, we will 
use ML-primal to refer to the input space and ML-dual to refer to the ker- 
nel space for the remainder of the paper, though note that the references to 
primal and dual in the abstract refer to ML-primal and ML-dual. 

We introduce a new convex least squares variant of CCA which seeks a se- 
mantic projection that uses as few relevant features as possible to explain 
as much correlation as possible. In previous studies, CCA had either been 
formulated in the ML-primal (input) or ML-dual (kernel) representation for 
both views. These formulations, coupled with the need for sparsity, could 
prove insufficient when one desires or is limited to a ML primal-dual represen- 
tation, i.e. one wishes to learn the correlation of words in one language that 
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map to documents in another. We address these possible scenarios by formu- 
lating SCCA in a ML primal-dual framework in which one view is represented 
in the ML-primal and the other in the ML-dual (kernel defined) represen- 
tation. We compare SCCA with KCCA on a bilingual English- French and 
English-Spanish data-set for a mate retrieval task. We show that in the mate 
retrieval task SCCA performs as well as KCCA when the number of original 
features is small and SCCA outperforms KCCA when the number of original 
features is large. This emphasises SCCA's ability to learn the semantic space 
from a small number of relevant features. 

In Section 2 we give a brief review of CCA, and Section 3 formulates and de- 
fines SCCA. In Section 4 we derive our optimisation problem and show how 
all the pieces are assembled to give the complete algorithm. The experiments 
on the paired bilingual data-sets are given in Section 5. Section 6 concludes 
this paper. 

2 Canonical Correlation Analysis 

We briefly review canonical correlation analysis and its ML-dual (kernel) 
variant to provide a smooth understanding of the transition to the sparse 
formulation. First, basic notation representation used in the paper is defined 

b — boldface lower case letters represent vectors 
s — lower case letters represent scalars 
M — upper case letters represent matrices. 



The correlation between Xq and x;, can be computed as 

max p = — :, (1) 

where Caa = XaX'^^ and Chb = XijX'^ are the within-set covariance matrices 
and Cah — XaXl is the between-sets covariance matrix, is the matrix 
whose columns are the vectors Xj, i — !,...,£ from the first representation 
while Xf, is the matrix with columns Xj from the second representation. We 
are able to observe that scaling Wa, w;, does not effect the quotient in equa- 
tion (1), which is therefore equivalent to maximising w^Ca&Wfe subject to 
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The kernelising of CCA (Fyfe & Lai, 2000a, b) offers an alternative by first 
projecting the data into a higher dimensional feature space : x = (xi, . . . , a;„) 
</)t(x) = (01 (x), . . . , </)Ar(x)) {N > n,t — h) before performing CCA in the 
new feature spaces. The kernel variant of CCA is useful when the correla- 
tion is believed to exist in some non linear relationship. Given the kernel 
functions Ka and let Ka = X'^Xa and = X^Xi, be the linear kernel ma- 
trices corresponding to the two representations of the data, where Xa is now 
the matrix whose columns are the vectors 0a(xi), i = !,...,£ from the first 
representation while X^ is the matrix with columns 0b (x^) from the second 
representation. The weights and can be expressed as a linear combi- 
nation of the training examples = XaO- and = Xf,/3. Substitution into 
the ML-primal CCA equation (1) gives the optimisation 

maxp = — , 
«./3 y/a'Klcx/3Kll3 

which is equivalent to maximising a'KaK^^^P subject to oc'Kla = (3'Klf3 — 
1. This is the ML-dual form of the CCA optimisation problem given in equa- 
tion (1) which can be cast as a generalised eigenvalue problem and for which 
the first k generalised eigenvectors can be found efficiently. Both CCA and 
KCCA can be formulated as symmetric eigenproblems. 

A variety of theoretical analyses have been presented for CCA (Akaho, 2001; 
Bach & Jordan, 2002; Fukumizu, Bach & Grctton, 2007; Hardoon, Szcd- 
mak & Shawe-Taylor, 2004; Shawe- Taylor & Cristianini, 2004; Hardoon & 
Shawe- Taylor, In Press). A common conclusion of some of these analyses 
is the need to regularise KCCA. For example the quality of the generalisa- 
tion of the associated pattern function is shown to be controlled by the sum 
of the squares of the weight vector norms in (Hardoon & Shawe-Taylor, In 
Press). Although there are advantages in using KCCA, which have been 
demonstrated in various experiments across the literature, we clarify that 
when using a linear kernel in both views, regularised KCCA is the same as 
regularised CCA (since the former and latter are linear). Nonetheless using 
KCCA with a linear kernel can have advantages over CCA, the most impor- 
tant being speed when the number of features is larger than the number of 
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samples.^ 



3 Sparse CCA 

The motivation for formulating a ML primal-dual SCCA is largely intuitive 
when faced with real-world problems combined with the need to understand 
or interpret the found solutions. Consider the following examples as potential 
case studies which would require ML primal-dual sparse multivariate analysis 
methods, such as the one proposed. 

• Enzyme prediction; in this problem one would like to uncover the re- 
lationship between the enzyme sequence, or more accurately the sub- 
sequences within each enzyme sequence that are highly correlated with 
the possible combination of the enzyme reactants. We would like to 
find a sparse ML-primal weight representation on the enzyme sequence 
which correlates highly to sparse ML-dual feature vector of the reac- 
tants. This will allow a better understanding of the enzyme structure 
relationship to reactions. 

• Bilingual analysis; when learning the semantic relationship between two 
languages, we may want to understand how one language maps from the 

word space (ML-primal) to the contextual document (ML-dual) space 
of another language. In both cases we do not want a complete mapping 
from all the words to all possible contexts but to be able to extract an 
interpretable relationship from a sparse word representation from one 
language to a particular and specific context (or sparse combination 
of) in the other language. 

• Brain analysis; here, one would be interested in finding a (ML-primal) 
sparse voxeP activation map to some (ML-dual) non-linear stimulus 

activation (such as musical sequences, images and various other mul- 
tidimensional input). The potential ability to find only the relevant 
voxels in the stimuli would remove the particularly problematic issue 

^The KCCA toolbox used was from the code section 
http: / /academic. davidroihardoon. com/ 

voxel is a pixel representing the smallest three-dimensional point volume referenced 
in an fMRI (functional magnetic resonance imaging) image of the brain. It is usually 
approximately 3mm x 3mm. 
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of thresholding the full voxel activation maps that are conventionally 
generated. 



For the scope of this paper we limit ourselves to experiments with the bilin- 
gual texts problems. 

Throughout the paper we only consider the setting when one is interested in 
a ML-primal representation for the first view and a ML-dual representation 
for the second view, although it is easily shown that the given derivations 
hold for the inverted case (i.e. a ML-dual representation for the first view and 
a ML-primal representation for the second view) which is therefore omitted. 

Consider a sample from a pair of random vectors (i.i.d assumptions hold) 
of the form (x^,x^) each with zero mean (i.e. centred) where i — 1,. . . ,1. 
Let Xa and X^ be matrices whose columns are the corresponding training 
samples and let = X^Xi-, be the kernel matrix of the second view and 
W5 be expressed as a linear combination of the training examples = X^e 
(note that e is a general vector and should not be confused with notation 
sometimes used for unit coordinate vectors) . The primal-dual CCA problem 
can be expressed as a primal-dual Rayleigh quotient 

p — max 



Wa.Wf, ^W^XaX'^Wa^hXbXlw^, 

w',X,XiX,e 



— max 



w.,e ^w'^X,X'^w,e'XlX,X;,X,e 
w'^XaK^e 

— max — , (2 

wa,e ^w'.XaX'^Wae'K^e 

where we choose the primal weights Wa of the first representation and dual 
features e of the second representation such that the correlation p between 
the two vectors is maximised. As we are able to scale Wa and e without 
changing the quotient, the maximisation in equation (2) is equal to max- 
imising w'^XaKi,e subject to w'^X'^XaWa — e'X^e = 1. For simplicity let 
X = Xa, w = Wa and K = Kb. 



Having provided the initial primal-dual framework we proceed to reformulate 
the problem as a convex sparse least squares optimisation problem. We are 
able to show that maximising the correlation between the two vectors Ke 
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and X'w can be viewed as minimising the angle between them. Since the 
angle is invariant to rescaling, we can fix the scaling of one vector and then 
minimise the norm^ between the two vectors 

mm\\X'w - Ke\\^ (3) 

w,e 

subject to ||i^e|p = 1. This intuition is formulated in the following theorem, 

Theorem 1 Vectors w, e are an optimal solution of equation (2) if and only 
if there exist 7 such that //w,7e are an optimal solution of equation (3). 

Theorem 1 is well known in the statistics community and corresponds to the 

equivalence between one form of Alternating Conditional Expectation (ACE) 
and CCA (Breiman & Friedman, 1985; Hastie & Tibshirani, 1990). For an 
exact proof see Theorem 5.1 on page 590 in (Breiman & Friedman, 1985). 

Constraining the 2— norm of Ke (or X'w) will result in a non convex problem, 
i.e we will not obtain a positive /negative- definite Hessian matrix. Motivated 
by the Rayleigh quotient solution for optimising CCA, whose resulting sym- 
metric eigenproblem does not enforce the llA'ep = 1 constraint, i.e. the 
optimal solution is invariant to rescaling of the solutions. Therefore we re- 
place the scaling of ||i^e||^ = 1 with the scaling of e to be ||e||oo = 1- We 
will address the resulting convexity when we achieve the final formulation. 

After finding an optimal CCA solution, we are able to re-normalise e so that 
||Ke||^ = 1 holds. We emphasis that even though K has been removed from 
the constraint the hnk to kernels (kernel tricks and RKHS) is represented 
in the choice of kernel K used for the dual-view, otherwise the presented 
method is a sparse linear CCA^. We can now focus on obtaining an optimal 
sparse solution for w, e. 

It is obvious that when starting with w = e = further minimising is 
impossible. To avoid this trivial solution and to ensure that the constraints 
hold in our starting condition^ we set ||e||oo = 1 by fixing 6^ — 1 for some 

^We define || • || to be the 2— norm. 

^Onc should keep in mind that even kernel CCA is still linear CCA performed in kernel 

defined feature space 

^||e||oo = max(|ei|, . . . , |e^|) = 1, therefore there must be at least one Cj for some i that 
is equal to 1. 
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fixed index 1 < k < i so that e = [ei, . . . , Cfe-i, Ck, Ck+i, . . . , e^]. To further 
obtain a sparse solution on e we constrain the 1— norm of the remaining 
coefficients ||e||i, where we define e = [ei, . . . , Cfe-i, Cfc+i, . . . , e^]. The mo- 
tivation behind isolating a specific k and constraining the 1— norm of the 
remaining coefficients, other than ensuring a non-trivial solution, follows the 
intuition of wanting to find similarities between the samples given some basis 
for comparison. In the case of documents, this places the chosen document 
(indexed by A;) in a semantic context defined by an additional (sparse) set of 
documents. This captures our previously stated goal of wanting to be able to 
extract an interpretable relationship from a sparse word representation from 
one language to a particular and specific context in the other language. The 
j G choices of k correspond to the Gj,Wj projection vectors. We discuss 
the selections of k and the ensuring of orthogonality of the sparse projections 
in Section 4.2. 

We are also now able to constrain the 1— norm of w without effecting the 
convexity of the problem. This gives the final optimisation as 

min \\X'w - i^e||^ + /i||w||i + 7||e||i (4) 

w,e 

subject to ||e||oo = 1- The expression ||Xw — KeW^ is quadratic in the vari- 
ables w and e and is bounded from below (> 0) and hence is convex since 
it can be expressed as \\Xw — A'ep = C + g'w + f'e + [w'e']//[w'e']'. If 
H were not positive definite taking multiple // of the eigenvector v' = [v'lv!^] 
with negative eigenvalue A would give C + ng'vi + /x/'f2 + A*^A creating arbi- 
trarily large negative values. When minimising subject to linear constraints 
(1-norms are linear) this makes the whole optimisation convex. 

While equation (4) is similar to Least Absolute Shrinkage and Selection Op- 
erator (LASSO) (Tibshirani, 1994)^, it is not a standard LASSO problem 
unless e is fixed. The problem in equation (4) could be considered as a 
double-barreled LASSO where we are trying to find sparse solutions for both 
w, e. 

^Basis Pursuit Denoising (Chen, Donoho & Saunders, 1999) 
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4 Derivation &6 Algorithm 



We propose a novel method for solving the optimisation problem represented 
in equation (4), where the suggested algorithm minimises the gap between 
the primal and dual Lagrangian solutions using a greedy search on w, e. The 
proposed algorithm finds a sparse w, e vectors, by iteratively solving between 
the ML primal and dual formulation in turn. We give the proposed algorithm 
as the following high-level pseudo-code. A more complete description will 
follow later; 



1. Use the dual Lagrangian variables to solve the ML-primal vari- 
ables 

2. Check whether all constraints on ML-primal variables hold 

3. Use ML-primal variables to solve the dual Lagrangian variables 

4. Check whether all dual Lagrangian variable constraints hold 

5. Check whether 2. holds, IF not go to 1. 



We have yet to address how to determine which elements in w, e are to be 
non-zero. We will show that from the derivation given in Section 4.1a lower 
and upper bound is computed. Combining the bound with the constraints 
provides us with a criterion for selecting the non-zero elements for both w 
and e. The criteria being that only the respective indices which violate the 
bound and the various constraints need to be updated. 

We proceed to give the derivation of our problem. The minimisation 



• Repeat 



• End 



min ||X'w — Ke\\'^ + At||w||i -|- 7||e||i 



w,e 



subject to 




oo 



1 can be written as 



w'XX'w + e'K'^e - 2w'XKe + n\\w\\i + 7||e||i 



subject to 




oo 



1, where fi, 7 are fixed positive parameters. 
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To simplify our mathematical notation we revert to uniformly using e in 
place of e, as k will be fixed in an outer loop so that the only requirement 
is that no update will be made for e^, which can be enforced in the actual 
algorithm. We further emphasis that we are only interested in the positive 
spectrum of e, which again can be easily enforced by updating any < 
to be Cj = 0^. Therefore we could rewrite the constraint ||e||oo = 1 as 
< ej < l,Vi e 

We are able to obtain the corresponding Lagrangian 

C = w'XX'w + e'K^e - 2WXKe + ii\\w\\i + -/e'j - (3'e, 

subject to 

/3 > 0, 

where /3 is the dual Lagrangian variable on e and //, 7 are positive scale fac- 
tors as discussed in Theorem 1 and j is the all ones vector. We note that as 
we algorithmically ensure that e > we are able to write 7||e||i = 7e'j as 

II II -K^t I I 

The constants 11, 7 can also be considered as the hyper-parameters (or reg- 
ularisation parameters) common in the LASSO literature, controlling the 
trade-off between the function objective and the level of sparsity. We show 
that the scale parameters can be treated as a type of dual Lagrangian pa- 
rameters to provide an underlying automatic determination of sparsity. This 
potentially sub-optimal setting still obtains very good results and is discussed 
in Section 5.1. 

To simplify the 1— norm derivation we express w by its positive and neg- 
ative components^ such that w = w+ — w~ subject to w+, w~ > 0. We limit 
ourselves to positive entries in e as we expect to align with a positive subset 
of articles. 



^We can also easily enforce the || • ||oo constraint by updating any Cj > 1 to be = 1. 
^This means that w+/w~ will only have the positive/negative values of w and zero 
elsewhere. 
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This allows us to rewrite the Lagrangian as 

C = (w+-w-)'XX'(w+- W-) +e'ir2e (5) 
-2(w+ - w^yXKe - ct-'w- - a+'w+ - (3'e 
+7(e'j)+M(w+ + w-)'j). 

The corresponding Lagrangian in equation (5) is subject to 

a+ > 
a- > 
/3 > 0. 

The two new dual Lagrangian variables a"*", a" are to uphold the positivity 
constraints on w"*", w~. 



4.1 sec A Derivation 

In this section we will show that the constraints on the dual Lagrangian 
variables will form the criterion for selecting the non-zero elements from 
w and e. First we define further notations used. Given the data matrix 
X e ]^"^x^ and Kernel matrix K e R^^^ as defined in Section 3, we define 
the following vectors 

w+ = [wt,...,w^] 
a+ = [aj^, . . . 
e = [ei,...,ei] 

Throughout this section let i be the index of either w, e that needs to be 
updated. We use the notation or [-jj to refer to the ith index within a 
vector and to refer to the ith element on the diagonal of a matrix. 



Taking derivatives of equation (5) in respect to w"*", w , e and equating 
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to zero gives 



dC 



2XX'(w+ - w") - 2X'Ke - a+ + //j = 



(6) 



dC 



2XX'{w+ - w") + 2X'Ke - a" + /xj - 



2X^e - 2KX'w - /3 + 7'j = 0, 



adding the first two equations gives 



a = 2yuj — a 
a~ = 2yuj — a 



implying a lower and upper component-wise bound on a , a"*" of 

< a" < 2//j 
< a+ < 2//j. 

We use the bound on cx to indicate which indices of the vector w need to 
be updated by only updating the tOj's whose corresponding violates the 
bound. Similarly, we only update that has a corresponding f3i value smaller 
than 0. 

We are able to rewrite the derivative with respect to w"*" in terms of q:~ 



We wish to compute the update rule for the selected indices of w. Taking 
the second derivatives of equation (5) in respect to w"*" and w~, gives 



dC 



2XX'{w+ - W-) - 2X'Ke - 2//j + a' + /ij 
2XX'{w+ - W-) - 2X'Ke - /xj + (x . 



5w+ 



2XX' 



= 2XX' 
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so for the i,, the unit vector with entry 1, we have an exact Taylor series 
expansion t"*" and t~ respectively for wl' and w~ as 

£(w- + ri.) = c(^-) + —t- + —(ty 

giving us the exact update for w^^ by setting 

^^^^dt^ ^^''^ ^ (2XX'(w+-w-)-2X'Ke-a+ + /.j)^ + 4(XX')..t+ = 
= ^^^L_[2X'Xe-2XX'(w+-w-)-a- + ;^j]^. 

Therefore the update for is Aw^' — t~^. We also compute the exact update 
for as 

dC{w- + t-ii) ^ (-2XX'iw+-w-) + 2X'Ke-cx- + ^ii). + 4{XX%it- 
ot * 

^ — [2X'Xe - 2XX\w+ - W-) - a- + ^j] ^ , 



4 (XX 

so that the update for is Aw^ — t~. Recall that w = (w"*" — w~), hence 
the update rule for Wi is 

Wi ^ Wi + (A«;+ - Ator). 
Therefore we find that the new value of Wi should be 

Wi^w, + ^(^^ [2X'Ke - 2XX'w - a" + /.j] . . 

We must also consider the update of Wi when ccj is within the constraints and 
Wi 7^ 0, i.e. previously ctj had violated the constraints triggering the updated 
of Wi to be non zero. Notice from equation (6) that 

2{XX')iiWi + 2 = '2{X'Ke)i -ai + fi. 
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It is easy to observe that the only component which can change is 2{XX')iiWi, 
therefore as we need to update Wi towards zero. Hence when Wi > the 
absolute value of the update is 

2{XX')uAwi = 2fx-ai 
2/1 — ai 



2{XX')u 

else when < then the update is the negation of 

2{XX')iAwi = 0-ai 



2(XX% 



so that the update rule is Wi ^ Wi — Awi. In the updating of Wi we ensure 
that Wi, Wi do not have opposite signs, i.e. we will always stop at zero before 
updating in any new direction. 

We continue by taking second derivatives of the Lagrangian in equation (5) 
with respect to e, which gives 

so for ij, the unit vector with entry 1, we have an exact Taylor series expansion 

£(e + ti,)=£(e) + ^t+^(t)2 
oci dsi 

giving us the following update rule for Cj 

^^^^^ ^''^ = {2K^e-2KX'w-(3 + i'i)i + AKlt^Q 

a 

the update for e is Ae^ = t. The new value of Cj will be 

^ + [2KX'w - 2K'e + f3-^'j]., 

a 

again ensuring that < Cj < 1. 
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Algorithm 1 The SCCA algorithm 



input: Data matrix X e R™-^^, Kernel matrix K 6 IR*'*^ and the value 
k. 

% Initialisation: 

w = 0, j = 1, e = 0, ej; = 1 

M = if Ef mKeU 7 = I E- l{2i^'e)i| 

oc- = 2XKe + ni 

/ = (a<0) II (cx>2^lj) 

repeat 

% Update the found weight values: 
Converge over w using Algorithm 2 

% Find the dual values that are to be updated 
/3 = 2K^e - 2KX-W + jj 
J = (/3 < 0) 

% Update the found dual projection values 
Converge over e using Algorithn 3 

% Find the weight values that are to be updated 
a = 2XKe - 2XX'w + /ij 
/= (q;<0) II {a>2;uj) 
until convergence 

Output: Feature directions w,e 

Algorithm 2 The SCCA algorithm - Convergence over w 

repeat 

for i = 1 to length of / do 
if a/ > 2fj, then 
a/. = 2/i 

"^li ^ '^U + 2(xx'),^,,. [2(-^-ffe)/, - 2(XXV)j, - + ^ 
else if < then 

O'h = 

wi, ^ wi, + 2(xx')j^,j^ [2(^^fe)/, - 2(XX'w)/, - aj. + n 
else 

if «)/. > then 

2fi-oci 

WI, ^ Wl. - 2(xx0j.,j, 
else if wi- < then 

Wl, ^ Wi^ + 2(Xx'ij.j. 

end if 
end if 

if sign(wi.) ^ sign{wi.) then 

wi, = d 
else 

wi- = Wl. 
end if 
end for 
until convergence over w 
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Algorithm 3 The SCCA algorithm - Convergence over e 

repeat 

for i = 1 to length of J do 
\i Ji ^ k then 

^Ji ^ ej, + jd— [2{KX'v>)j. - 2(K^e)j, - 7] 

if ej < then 

4 = 

else if ej. > 1 then 

end if 
end if 
end for 
until convergence over e 



4.2 SCCA Algorithm 

Observe that in the initial condition when w = from equations (6) we are 
able to treat the scale parameters 7 as dual Lagrangian variables and set 
them to 

// = -J]|(2XXe),| 

i 

We emphasise that this is to provide an underlying automatic determination 
of sparsity and may not be the optimal setting although we show in Section 
5.1 that this method works well in practice. Combining all the pieces we 
give the SCCA algorithm as pseudo-code in Algorithm 1, which takes A; as a 
parameter. In order to choose the optimal value of k we would need to run 
the algorithm with all values of k and select the one giving the best objective 
value. This would be chosen as the first feature. 

To ensure orthogonality of the extracted features (Shawe- Taylor & Cristian- 
ini, 2004) for each e^- and corresponding Wj, we compute the residual matrices 
Xj, J = 1, . . . by projecting the columns of the data onto the orthogonal 
complement of Xj(XjXjwj), a procedure known as deflation, 

where ?7 is a matrix with columns Wj = XjX'jWj and P is a matrix with 

^ X.' u 

columns = . The extracted projection directions can be computed 



16 



Algorithm 4 The SCCA algorithm with deflation 

input: Data matrix X 6 M"***^, Kernel matrix K e R^^^. 

Xi = X, Ki = K 
for j = 1 to £ do 

k = j 

[ej,vfj] = SCCAJ^\gorithml{Xj,Kj,k) 



it j <e then 

x,+i=x,(7-u^p;.) 

end if 
end for 



(following (Shawe- Taylor & Cristianini, 2004)) as U{P'U) ^. Similarly we 
deflate for the dual view 



where = Kj{Kjej) and compute the projection directions as B(T'KB)~^T 
where S is a matrix with columns Kjej and T has columns Tj. The deflation 
procedure is illustrated in pseudocode in Algorithm 4, for a detailed review 
on deflation we refer the reader to (Shawe- Taylor & Cristianini, 2004). 

Checking each value of k at each iteration is computationally impractical. 
In our experiments we adopt the very simplistic strategy of picking the val- 
ues of k in numerical order k = !,...,£. Clearly, there exists intermediate 
options of selecting a small subset of values at each stage, running the algo- 
rithm for each and selecting the best of this subset. This and other extension 
of our work will be focused on in future studies. 



5 Experiments 

In the following experiments we use two paired English-French and English- 
Spanish corpora. The English-French corpus consists of 300 samples with 



17 



2637 English features and 2951 French features while the English-Spanish 
corpus consists of 1,000 samples with 40,629 English features and 57,796 
Spanish features. The features represent the number of words in each lan- 
guage. Both corpora are pre-processed into a Term Frequency Inverse Docu- 
ment Frequency (TFIDF) representation followed by zero-meaning (centring) 
and normalisation. The linear kernel was used for the dual view. The best 
test performance for the KCCA regularisation parameter for the paired cor- 
pora was found to be 0.03. We used this value to ensure that KCCA was not 
at a disadvantage since SCCA had no parameters to tune. 

5.1 Hyperparameter Validation 

In the following section we demonstrate that the proposed approach for au- 
tomatically determining the regularisation parameter (hyper-parameter) 
(or alternatively 7) is sufficient for our purpose. The SCCA problem 

min ||X'w — KeW"^ + /i||w||i -|- 7||e||i, (7) 

w,e 

subject to ||e||oo = 1 can be simplified to a general LASSO solver by removing 
the optimisation over e, resulting in 

min ||X'w — k||^ -|- /x||w||i, 

w 

where, given our paired data, k is the inner product between the query and 
the training samples and X is the second paired data samples. This simpli- 
fied formulation is trivially solved by Algorithm 1 by ignoring the loops that 
adapt e. The simplification of equation (7) allows us to focus on showing 
that jj, is close to optimal, which is also true for 7, and therefore omitted. 

The hyper-parameters control the level of sparsity. Therefore, we test the 
level of sparsity as a function of the hyper-parameter value. We proceed by 
creating a new document d* from a paired language that best matches our 
query^ and observe how the change in fi affects the total number of words 
being selected. An "ideal" fi would generate a new document, in the paired 
language, and select an equal number of words in the query's actual paired 

^i.e. given a query in French we want to generate a document in English that best 
matches the query. The generated document can then be compared to the actual paired 
English document. 
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document. Recall that the data has been mean corrected (centred) and there- 
fore no longer sparse. 

We set fi to be in the range of [0.001,...,!] with an increment of 0.001 
and use a leave-paired document-out routine for the English-French corpus, 
which is repeated for all 300 documents. Figure 1 illustrates, for a single 
query, the effective change in /i on the level of sparsity. We plot the ratio of 
the total number of selected words to the total number of words in the origi- 
nal document. An ideal choice of n would choose a ratio of 1 (the horizontal 
lines) i.e. create a document with exactly the same number of words as the 
original document or in other words select a fi such that the cross would 
lie on the plot. We are able to observe that the method for automatically 
choosing fi (the vertical line) is able to create a new document with a close 
approximation to the total number of words in the original document. 




Different .Li values 



Figure 1: Document generation for the English- French corpus (visualisation 
for a single query): We plot the ratio of total number of selected words to the 
total number of words in the original document. The horizontal line define 
the "ideal choice" where the total number of selected words is identical to the 
total number of words in the original document. The vertical line represent 
the result using the automatic setting of the hyper-parameter. We are able 
to observe that the automatic selection of /i is a good approximation for 
selecting the level of sparsity. 
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In Table 1 we are able to show that the average ratio of total number 
of selected words for each document generated in the paired language is 
very close to the "ideal" level of sparsity, while a non-sparse method (as 
expected) generates a document with an average of ~ 28 times the number 
of words from the original document. Now that we have established the 
automatic setting of the hyper-parameters, we proceed in testing how 'good' 
the selected words are in the form of a mate-retreiveal experiment. 

Table 1: French- English Corpus: The ratio of the total number of selected 
words to the actual total number of words in the paired test document, 
averaged over all queries. The optimal average ratio if we always generate 
an 'ideal' document is 1. 





Average Selection Ratio 


Automatic setting of 
Non-sparse method 


1.01 ±0.54 
28.15 ± 15.71 



5.2 Mate Retrieval 

Our experiment is of mate-retrieval, in which a document from the test cor- 
pus of one language is considered as the query and only the mate document 
from the other language is considered relevant. In the following experiments 
the results are an average of retrieving the mate for both English and French 
(English and Spanish) and have been repeated 10 times with a random train- 
test split. 

We compute the mate-retrieval by projecting the query document as well 
as the paired (other language) test documents into the learnt semantic space 
where the inner product between the projected data is computed. Let q be 
the query in one language and Kg the kernel matrix of the inner product 
between the second language's testing and training documents 

^ _ / g^w K^e \ 
\||g'w||' ||i^se||/ ' 

The resulting inner products I are then sorted by value. We measure the 
success of the mate-retrieval task using average precision, this assesses where 
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the correct mate within the sorted inner products I is located. Let Ij be the 
index location of the retrieved mate from query qj, the average precision p is 
computed as 



where M is the number of query documents. 

We start by giving the results for the English- French mate-retrieval as shown 
in Figure 2. The left plot depicts the average precision (± standard devi- 
ation) when 50 documents are used for training and the remaining 250 are 
used as test queries. The right plot in Figure 2 gives the average precision (± 
standard deviation) when 100 documents are used for training and the re- 
maining 200 for testing. It is interesting to observe that even though SCCA 
does not learn the common semantic space using all the features (average 
plotted in Figure 3) for either ML primal or dual views (although SCCA will 
use full dual features when using the full number of projections) its error 
is extremely similar to that of KCCA and in fact converges with it when a 
sufficient number of projections are used. It is important to emphasise that 
KCCA uses the full number of documents (50 and 100) and the full number of 
words (an average of 2794 for both languages) to learn the common semantic 
space. For example, following the left plot in Figiue 2 and the additional 
plots in Figure 3 we are able to observe that when 35 projections are used 
KCCA and SCCA show a similar error. However, SCCA uses approximately 
142 words and 42 documents to learn the semantic space, while KCCA uses 
2794 words and 50 documents. 

The second mate-retrieval experiment uses the English- Spanish paired cor- 
pus. In each run we randomly split the 1000 samples into 100 training and 
900 testing paired documents. The results are plotted in Figure 4 where we 
are clearly able to observe SCCA outperforming KCCA throughout. We be- 
lieve this to be a good example of when too many features hinder the learnt 
semantic space, also explaining the difference in the results obtained from 
the English-French corpus as the number of features are significantly smaller 
in that case. The average level of SCCA sparsity is plotted in Figure 5. In 
comparison to KCCA which uses all words (49, 212) SCCA uses a maximum 
of 460 words. 
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The performance of SCCA, especially in the latter English-Spanish experi- 
ment, shows that we are indeed able to extract meaningful semantics between 
the two languages, using only the relevant features. 




Figure 2: English- French: The average precision error (l-p) with ± standard 
division error bars for SCCA and KCCA for different number of projections 
used for the mate-retrieval task. The left figure is for 50 training and 250 
testing documents while the right figure is for 100 training and 200 testing 
documents. 

Despite these already impressive results our intuition is that even better 
results are attainable if the hyper-parameters would be tuned to give opti- 
mal results. The question of hyper-parameter optimality is left for future 
research. Although, it seems that the main gain of SCCA is sparsity and 
interpretability of the features. 

6 Conclusions 

Despite being introduced in 1936, CCA has proven to be an inspiration for 
new and continuing research. In this paper we analyse the formulation of 
CCA and address the issues of sparsity as well as convexity by presenting 
a novel SCCA method formulated as a convex least squares approach. We 
also provide a different perspective of solving CCA by using a ML primal- 
dual formulation which focuses on the scenario when one is interested in (or 
limited to) a ML-primal representation for the first view while having a ML- 
dual representation for the second view. A greedy optimisation algorithm is 
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Number of projections used Number of projections used 



Figure 3: English- French: Level of Sparsity - The following figure is an 
extension of Figure 2 which uses 50 documents for training. The left figure 
plots the average number of words used while the right figure plots the average 
number of documents used with the number of projections. For reference, 
KCCA uses all the words (average of 2794) and documents (50) for all number 
of projections. 
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10 20 30 40 50 60 70 80 90 100 
Number of projection used 



Figure 4: English-Spanish: The average precision error [l-p) with ± standard 
division error bars of SCCA and KCCA for different number of projections 
used for the mate-retrieval task. We use 100 documents for training and 900 
for testing documents. 



24 



SCCA 



SCCA 




20 40 60 80 100 20 40 60 80 100 
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Figure 5: English-Spanish: Level of Sparsity - The following figure is an 
extension of Figure 4 which uses 100 documents for training. The left figure 
plots the average number of words used and while the right figure plots the 
average number of documents used with increasing number of projections. 
For reference, KCCA uses all the words (average of 49, 212) and documents 
(100) for all number of projections. 

derived. 

The method is demonstrated on a bi-lingual English-French and English- 
Spanish paired corpora for mate retrieval. The true capacity of SCCA be- 
comes visible when the number of features becomes extremely large as SCCA 
is able to learn the common semantic space using a very sparse representa- 
tion of the ML primal-dual views. 

The papers raison d'etre was to propose a new efficient algorithm for solving 
the sparse CCA problem. We believe that while addressing this problem new 
and interesting questions which need to be addressed have surfaced 

• How to automatically compute the hyperparameters /z, 7 values so to 
achieve optimal results? 

• How do we set k for each when we wish to compute less than ^ 
projections? 

• Extending SCCA to a ML primal-primal (ML dual-dual) framework. 
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We believe this work to be an initial stage for a new sparse framework to be 
explored and extended. 
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